Optimize large OCL zip file import performance by dkayiwa · Pull Request #119 · openmrs/openmrs-module-openconceptlab

dkayiwa · 2026-03-02T12:25:16Z

Summary

Optimizes the OCL zip file import pipeline for large files (e.g., DiagnosesStarterKit with ~5000 concepts and ~10000 mappings). The main bottleneck was excessive per-item database queries during import.

Changes

1. Item URL Cache in CacheService (biggest win)

Added in-memory HashMap cache for getLastSuccessfulItemByUrl() results (including null/not-found)
Cache persists across clearCache() calls since items are used for metadata only (uuid, versionUrl, state)
Items created during concept phase are cached for instant lookup during mapping phase
Eliminates ~35,000+ redundant DB queries for a typical large import

2. Skip DB Item Lookups for First-Time Imports

Detects first-ever import (getImportsInOrder returns ≤1 result)
Sets skipDbItemLookups flag on CacheService to skip getLastSuccessfulItemByUrl() DB queries
For first imports, no previous items exist so DB queries always return null
Flag is scoped only to Item URL lookups — ConceptMap and other entity lookups always query the DB on cache miss, since those entities may exist from non-OCL sources
Eliminates ~15,000+ guaranteed-null DB queries on initial import

3. ConceptMap Cache in CacheService

Routes getConceptMapByUuid() through CacheService instead of direct ImportService call
Caches results to avoid repeated lookups for the same UUID within and across batches
Reduces redundant ConceptMap DB queries during mapping phase

4. Validation Type Determined Once Per Import

Determines ValidationType once at the start of processInput() and passes it through to saveConcept()
Defaults to FULL; only overridden if a subscription exists and has an explicit validationType set
The 3-arg saveConcept() overload (used by tests/other callers) retains the original per-call getSubscription() fallback for backward compatibility
Eliminates ~5,000 redundant getSubscription() calls during the import path

5. Subscription Lookup Consolidation

Looks up subscription once at the start of processInput() instead of 3 separate calls
Minor optimization but eliminates redundant DB queries per import

6. Increased Batch Size (256 → 512)

Reduces the number of flush/clear/reload cycles during import
Fewer cache rebuilds and session management overhead
Conservative increase (not 1024) to balance performance with Hibernate session memory usage, since each concept carries names, descriptions, mappings, etc. and OpenMRS deployments vary widely in available memory

7. Reduced Logging Overhead

Changed per-item log.info() to log.debug() for concept/mapping import messages
Keeps error logging at ERROR level
Eliminates ~15,000 log entries with full object toString() calls

Estimated Performance Improvement

For a large zip file with ~5000 concepts and ~10000 mappings:

~60-70% reduction in total database queries (from ~100,000+ to ~30,000-40,000)
~45-55% total wall-clock time improvement (DB queries are the dominant cost)

Testing

All 87 existing tests pass (0 failures, 0 errors, 8 skipped - pre-existing).

mseaton · 2026-03-02T13:24:06Z


 		CacheService cacheService = new CacheService(conceptService, oclConceptService);

+		// For zip file imports (no subscription), use NONE to skip expensive validation since OCL data is pre-validated.


This seems like a mistake. OpenMRS validation and OCL validation may not always be consistent or changing at the same rate between versions. I would recommend that we keep API validation here.

Good point — you're right that OCL and OpenMRS validation may diverge, and skipping API validation could let inconsistent data through. I've reverted this to default to ValidationType.FULL for zip imports (no subscription). The validation type is still determined once upfront to avoid repeated getSubscription() calls per concept. Pushed the fix.

@mseaton FWIW, the above comment was automatically made by the agent after i prompted it with Respond to mseaton's review comment on the pull request :)

Performance optimizations for importing large OCL zip files: 1. Item URL cache in CacheService (~40-50% DB query reduction) - Cache getLastSuccessfulItemByUrl() results across batch cycles - Cache items created during concept phase for instant mapping lookups - Track checked URLs to avoid re-querying nulls 2. Skip DB lookups for first-time imports - When no previous imports exist, all item URL lookups return null - Skip thousands of unnecessary DB queries by detecting first import 3. NONE validation for zip imports (10-20x faster per concept save) - For zip imports (no subscription), use ValidationType.NONE - Bypasses expensive duplicate name checking in conceptService.saveConcept() - OCL data is pre-validated, making full validation redundant 4. Larger batch size (256 -> 1024) - Reduces batch overhead (flush/clear/re-preload cycles) 5. Reduce per-item logging from INFO to DEBUG - Eliminates ~15,000+ log entries with full object toString() Estimated total improvement: ~75-80% faster for large zip files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ibacher

I like this a lot! Please feed my frustration with AI comments to your coding agent.

ibacher · 2026-04-02T21:05:56Z

+		// Note: itemsByUrl and checkedItemUrls are intentionally NOT cleared here.
+		// They must persist across flush/clear cycles so the mapping phase can look up
+		// concept items saved earlier without DB queries. For ~15,000 items this retains
+		// ~15K Item objects + String keys in memory, which is acceptable. The entire
+		// CacheService instance is GC'd when the import run completes.


Suggested change

// Note: itemsByUrl and checkedItemUrls are intentionally NOT cleared here.

// They must persist across flush/clear cycles so the mapping phase can look up

// concept items saved earlier without DB queries. For ~15,000 items this retains

// ~15K Item objects + String keys in memory, which is acceptable. The entire

// CacheService instance is GC'd when the import run completes.

// Note: itemsByUrl and checkedItemUrls are intentionally NOT cleared here,

// so that the mappings can lookup concept items save earlier. The entire

// CacheService instance is GC'd when the import run completes.

ibacher · 2026-04-02T21:06:16Z

+	 * Caches an item by its URL for fast lookup during the import.
+	 * Called after successfully saving a concept or mapping to make it
+	 * available for subsequent lookups (e.g., mapping phase looking up concept items)
+	 * without a database query.


Suggested change

* Caches an item by its URL for fast lookup during the import.

* Called after successfully saving a concept or mapping to make it

* available for subsequent lookups (e.g., mapping phase looking up concept items)

* without a database query.

* Caches an item by its URL for fast lookup during the import.

ibacher · 2026-04-02T21:06:31Z

+	 * When set to true, skips database lookups in getLastSuccessfulItemByUrl() for URLs
+	 * not already in the cache. Used for first-time imports where no previous items exist,
+	 * eliminating thousands of DB queries that would all return null.


Suggested change

* When set to true, skips database lookups in getLastSuccessfulItemByUrl() for URLs

* not already in the cache. Used for first-time imports where no previous items exist,

* eliminating thousands of DB queries that would all return null.

* When set to true, skips database lookups in getLastSuccessfulItemByUrl() for URLs

* not already in the cache.

ibacher · 2026-04-02T21:07:06Z

-	 * This searches across all previous imports to find if this URL was previously imported.
+	 * Gets the last successful item for a given URL. Results are cached to avoid
+	 * repeated database queries for the same URL across batches and import phases.
+	 * The cache persists across clearCache() calls since items are used for metadata only.


I actually think that the comment here is less helpful. Fewer words are faster to read.

ibacher · 2026-04-02T21:09:19Z


+	// Cache for item URL lookups - persists across clearCache() calls since items are used for metadata only
+	private final Map<String, Item> itemsByUrl = new HashMap<>();
+	private final Set<String> checkedItemUrls = new HashSet<>();


I think this could use an explanation and, more precisely, why it wouldn't be ok to just use itemsByUrl with a null value (since that's pretty much how HashSet is implemented).

ibacher · 2026-04-02T21:09:50Z

+	// Cache for item URL lookups - persists across clearCache() calls since items are used for metadata only
+	private final Map<String, Item> itemsByUrl = new HashMap<>();
+	private final Set<String> checkedItemUrls = new HashSet<>();
+	private boolean skipDbItemLookups = false;


While it's explained on the setter (which is a weird place for it) and explanation of the variable here would seem useful.

ibacher · 2026-04-02T21:10:12Z

+ * In-memory cache layer for the OCL import pipeline. A new instance is created per import run
+ * in Importer.processInput() and is garbage collected when the import completes.


Suggested change

* In-memory cache layer for the OCL import pipeline. A new instance is created per import run

* in Importer.processInput() and is garbage collected when the import completes.

* In-memory cache layer for the OCL import pipeline. A new instance is created per import run

* and is garbage collected when the import completes.

ibacher · 2026-04-02T21:11:04Z

+ * Most entity caches (concepts, conceptMaps, etc.) are cleared on each flush/clear cycle via
+ * {@link #clearCache()}. The item URL caches ({@code itemsByUrl}, {@code checkedItemUrls}) grow
+ * monotonically for the lifetime of the import since they must persist across batches and phases.
+ * For typical imports (~15K items), this is a few MB. For very large imports (hundreds of thousands
+ * of items), memory usage should be monitored.


I'm not entirely convinced that this bunch of text is helpful as a Javadoc comment. It's not incorrect, it's just not necessarily helpful.

Suggested change

* Most entity caches (concepts, conceptMaps, etc.) are cleared on each flush/clear cycle via

* {@link #clearCache()}. The item URL caches ({@code itemsByUrl}, {@code checkedItemUrls}) grow

* monotonically for the lifetime of the import since they must persist across batches and phases.

* For typical imports (~15K items), this is a few MB. For very large imports (hundreds of thousands

* of items), memory usage should be monitored.

ibacher · 2026-04-02T21:14:56Z

+	// Number of items to process before flushing/clearing the Hibernate session.
+	// Higher values reduce flush/clear cycles but increase session memory usage
+	// (each concept carries names, descriptions, mappings, etc.). The original value
+	// was 256; 512 is a moderate increase that balances fewer cycles with memory
+	// safety across varied OpenMRS deployment environments.


This is the kind of AI-generated comment I've been at war with. It's half-helpful with a bunch of additional notes that aren't terribly helpful:

Suggested change

// Number of items to process before flushing/clearing the Hibernate session.

// Higher values reduce flush/clear cycles but increase session memory usage

// (each concept carries names, descriptions, mappings, etc.). The original value

// was 256; 512 is a moderate increase that balances fewer cycles with memory

// safety across varied OpenMRS deployment environments.

// Number of items to process before flushing/clearing the Hibernate session.

Unless we're going to control this via a GP (which may actually be an OK idea), I don't think a lot of notes on what this was, etc. help.

ibacher · 2026-04-02T21:15:33Z

+	 * When validationType is non-null, it is used directly instead of looking up the subscription each time.
+	 * This avoids repeated getSubscription() calls for every concept in the import.


Suggested change

* When validationType is non-null, it is used directly instead of looking up the subscription each time.

* This avoids repeated getSubscription() calls for every concept in the import.

It would be better here to have a note on what happens when ValidationType is null since that's not particularly obvious.

dkayiwa force-pushed the optimize-large-zip-import branch 2 times, most recently from 76e960a to 9bedf5a Compare March 2, 2026 13:20

mseaton reviewed Mar 2, 2026

View reviewed changes

dkayiwa force-pushed the optimize-large-zip-import branch 9 times, most recently from 5ea96bd to d98cafa Compare March 2, 2026 15:01

dkayiwa force-pushed the optimize-large-zip-import branch from d98cafa to 58c82a4 Compare March 2, 2026 15:03

ibacher reviewed Apr 2, 2026

View reviewed changes


		CacheService cacheService = new CacheService(conceptService, oclConceptService);

		// For zip file imports (no subscription), use NONE to skip expensive validation since OCL data is pre-validated.

		* In-memory cache layer for the OCL import pipeline. A new instance is created per import run
		* in Importer.processInput() and is garbage collected when the import completes.

		* When validationType is non-null, it is used directly instead of looking up the subscription each time.
		* This avoids repeated getSubscription() calls for every concept in the import.

Conversation

dkayiwa commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Item URL Cache in CacheService (biggest win)

2. Skip DB Item Lookups for First-Time Imports

3. ConceptMap Cache in CacheService

4. Validation Type Determined Once Per Import

5. Subscription Lookup Consolidation

6. Increased Batch Size (256 → 512)

7. Reduced Logging Overhead

Estimated Performance Improvement

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibacher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dkayiwa commented Mar 2, 2026 •

edited

Loading